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Abstract 

In biological data, it is often the case that observed data are available 
only for a subset of samples. When a kernel matrix is derived from such 
data, we have to leave the entries for unavailable samples as missing. In 
this paper, we make use of a parametric model of kernel matrices, and 
estimate missing entries by fitting the model to existing entries. The 
parametric model is created as a set of spectral variants of a complete 
kernel matrix derived from another information source. For model fitting, 
we adopt the em algorithm based on the information geometry of positive 
definite matrices. We will report promising results on bacteria clustering 
experiments using two marker sequences: 16S and gyrB. 


1 Introduction 


In kernel machines such as support vector machines (SVM) ( fScholkopf and 
3mola, 2001), objects are represented as a kernel matrix, where n objects are 
represented as an n x n positive semidefinite matrix. Essentially the (i. j) entry 
of the kernel matrix describes the similarity between i-th and j -th objects. Due 
to positive semidefiniteness, the objects can be embedded as n points in an 
Euclidean feature space such that the inner product between two points equals 
to the corresponding entry of kernel matrix. This property enables us to apply 
diverse learning methods (for example, SVM or kernel PCA) without explicitly 
constructing a feature space ( Scholkopf and Smola , 2001 ). 

Biological data such as amino acid sequences, gene expression arrays and 
phylogenetic profiles are derived from expensive experiments (Brown, 1999|). 


Typically initial experimental measurements are so noisy that they cannot be 
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given to learning machines directly. Since high quality data are created by ex¬ 
tensive work of human experts, it is often the case that good data are available 
only for a subset of samples. When a kernel matrix is derived from such incom¬ 
plete data, we have to leave the entries for unavailable samples as missing. We 
call such a matrix an “incomplete matrix”. Our aim is to estimate the missing 
entries, but it is obviously impossible without additional information. So we 
make use of a parametric model of admissible matrices, and estimate missing 
entries by fitting the model to existing entries. 

In this scheme, it is important to define a parametric model appropriately. 
For example, Graepcl ( 2002| ) used the set of all positive definite matrices as a 
model. Although this model worked well when only a few entries are missing, 
this model is too general for our cases where whole columns and rows are missing. 
Thus we need another information source for constructing a parametric model. 
Fortunately, in biological data, it is common that one object is described by two 
or more representations. For example, genes are represented by gene networks 
and gene expression arrays at the same time (Vert and Kanehisa, 2002). Also a 
bacterium is represented by several marker sequences (Yamamoto et al., 2000|) . 


In this paper, we assume that a complete matrix is available from another 
information source, and a parametric model is created by giving perturbations 
to the matrix. We call the complete matrix a “base matrix”. When creating a 
parametic model of admissible matrices from a base matrix, one typical way is 
to define the parametric model as all spectral variants of the base matrix, which 
have the same eigenvectors but different eigenvalues (Hristianini et al., 2002|). 


When several base matrices are available, the weighted sum of these matrices 
would be a good parametric model as well (Lanckriet et al., 2002|). 


In order to fit a parametric model, the distance between two matrices has 
to be determined. A common way is to define the Euclidean distance between 
matrices (for example, the Frobeneous norm) and make use of the Euclidean 
geometry. Recently Vert and Kanehisa (2002) tackled with the incomplete rna- 


trix approximation problem by means of kernel CCA. Also Cristianini et al 
( [2002 ) proposed a similarity measure called “alignment”, which is basically the 
cosine between two matrices. In contrast that their methods are based on the 
Euclidean geometry, this paper will follow an alternative way: we will define the 
Kullback-Leibler (KL) divergence betw een two kernel mat rices and make use of 
the Riemannian information geometry ( Ohara et al. , 1990| ). The KL divergence 
is derived by relating a kernel matrix to a covariance matrix of Gaussian dis¬ 
tribution. The primal advantage is that the KL divergence allows us to use the 
em algorithm ( [Amari , 1995) to approximate an incomplete kernel matrix. The 
e and m steps are formulated as convex programming problems, and moreover 
they can be solved analytically when spectral variants are used as a parametric 
model. 

We performed bacteria clustering experiments using two marker sequences: 
16S and gyrB (Yamamoto et al., 2000|). We derived the incomplete and base 


kernel matrices from gyrB and 16S, respectively. As a result, even when 50% of 
columns/rows are missing, the clustering performance of the completed matrix 
was better than that of the base matrix, which illustrates the effectiveness of 
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our approach in real world problems. 

This paper is organized as follows: Sec. ^ introduces the information geome¬ 
try to the space of positive definite matrices. Based on geometric concepts, the 
em algorithm for matrix approximation is presented in Sec. |3], where detailed 
computations are deferred in Sec. [|. In Sec. |5| the matrix approximation prob¬ 
lem is formulated as statistical inference and the equivalence between the em 
and EM algorithms (Dempster et al.|, |1977D is shown. Then the bacteria clus¬ 


tering experiment is described in Sec. 
in Sec. ff|, we conclude the paper in Sec. 


After seeking for possible extensions 


2 Information Geometry of Positive Definite Ma¬ 
trices 


We first explain how to introduce the information geometry in the space of 
positive definite matrices. Only necessary parts of the theory will be presented 


here, so refer to (Ohara et al., 1996; Amari and Nagaoka. 2001) for details. 


Let us define the set of all d x d positive definite matrices as V. The first step 
is to relate a d x d positive definite matrix P £ V to the Gaussian distribution 
with mean 0 and covariance matrix P: 


P(X]P> ° ( 2 ^\P\W eXP( ~^ 


T P" 1 £ 


(1) 


It is well known that the Gaussian distribution belongs to the exponential family. 
The canonical form of an exponential family distribution is written as 

p(x\9) = exp(9 T r(x) — ^(fl)), 


where r(x) is the vector of sufficient statistics, 9 is the natural parameter and 
i/j(9) is the normalization factor. When (lj) is rewritten in the canonical form, 
we have the sufficient statistics as 


r(x) 



1 


di • • • ? %d—l3'd 


and the natural parameter as 

9 = (IP" 1 ]!!, • • • , IP" 1 U IP” 1 ] 12, • • ■ , [P~ Vl, d ) T , 


where [M ]denotes the (i, j) entry of matrix M. The natural parameter 9 
provides a coordinate system to specify a positive definite matrix P, which is 
called the 0-coordinate system (or the e-coordinate system). On the other hand, 
there is an alternative representation for the exponential family. Let us define 
the mean of ri(x) as ip: For example, when ri{x ) = x s xt , 


Vi = 


x s x t p{x\9)dx = P st . 
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This new set of parameters 77^ provides another coorninate system, called 77- 
coordinate system (or the m-coordinate system): 

V = (Plli ■ ■ ■ 1 Pdd, Pl2, ■ ■ ■ , Pd-l,d) ■ 

Let us consider the following curve 0{t) connecting two points 0 1 and O 2 linearly 
in 9 coordinates: 

9(t) = t{9 2 -9 l ) + 9 1 . 

When written is the matrix form, this reads 

p- i (t) = t(p 2 - 1 -pr 1 ) + p 1 - 1 . 

This curve is regarded as a straight line from the exponential viewpoint and 
is called an exponential geodesic or e-geodesic. In particular, each coordinate 
curve 8i = t, 9j = Cj (j / i) is an e-geodesic. When the e-geodesic between 
any two points in a manifold S C V is included in S, the manifold S is said to 
be e-flat. On the other hand, the mixture geodesic or m-geodesic is defined as 

V(t) =t(v2~Vi) + Vi¬ 
la the matrix form, this reads 

P(i) = f(Pj-Pi) + Pi. 


When the m-geodesic between any two points in S is included in S, the manifold 
S is said to be m-flat. 

In information geometry, the distance between probability distributions is 
defined as the Kullback-Leibler divergence (Amari and Nagaoka, 2001): 


KL(p,q)= [ p(x)log^-dx. 

J q(x) 


By relating a positive definite matrix to the covariance matrix of Gaussian (jl|), 
we have the Kullback-Leibler (KL) divergence for two matrices P, Q : 


KL(P , Q) = tr(Q 1 P) + log det Q — log det P — d. 


With respect to a manifold S C V and a point P G P, the projection from 
P to S is defined as the point in S closest to P. Since the KL divergence is 
asymmetric, there are two kinds of projection: 


• e-projection: Q* = argming g5 I\ L(Q, P). 

• m-projection: Q* = argming gS KL(P, Q). 


It is proved that the m-projection to an e-flat submanifold is unique, and e- 
projection to an m-flat manifold is unique ( Amari and Nagaoka| , [200 1| ). This 
uniqueness property means that the corresponding optimization problem is con¬ 
vex and so the global optimal solution is easily obtained by any reasonable 
method. 
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3 Approximating an Incomplete Kernel Matrix 


In this section, we describe the era algorithm to approximate an incomplete ker¬ 
nel matrix. Let Xi,..., xi £ X be the set of samples in interest. In supervised 
learning cases, this set includes both training and test sets, thus we are consid¬ 
ering the transductive setting ( Vapnik , |1998| ). Let us assume that the data is 
available for the first n samples, and unavailable for the remaining to := l — n 
samples. Denote by Kj annxn kernel matrix, which is derived from the data 
for the first n samples. Then, an incomplete kernel matrix is described as 


D = 


Ki D vh \ 
D T vh D hh ) ’ 


( 2 ) 


where D v h is an n x m matrix and Dhh is an to X to symmetric matrix. Since 
D has missing entries, it cannot be presented as a point in V. Instead, all the 
possible kernel matrices form a manifold 


V = {D | D vh € 5R nxm , Dhh S 5i mxm , D hh = Dj h , D y 0}, 


where D y 0 means that D is positive definite. We call it the data manifold as 
in the conventional EM algorithm (Ikeda et al., 1999). It is easy to verify that 
D is an m-flat manifold; hence, the e-projection to T> is unique. 

Next let us define the parametric model to approximate D. Here the model 
is derived as the spectral variants of Kg, which is an (. x £ base kernel matrix 
derived from another information source. Let us decompose I\b as 


i 

K b = 5Z A iVivJ, 

i =1 


where and Vi is the i-th eigenvalue and eigenvector, respectively. Define 


Mi = viv 


T 

i 5 


(3) 


then all the spectral variants are represented as 

£ 

M = {M | M = (3e 3^, My 0} 

i=i 


We call it the model manifold (Ikeda et al., 199E). For notational simplicity, we 
choose a different parametrization of M.: 


M = {M\M={J2 bjM 0 )-\ b e Sft^, My 0}, 

j=i 


(4) 


where bj = 1 /(3j. It is easily seen that the manifold M. is e-flat and m-flat at 
the same time. Such a manifold is called dually-flat. 
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Figure 1: Information geometric picture of the em algorithm. The data manifold 
V corresponds to the set of all completed matrices, whereas the model manifold 
A4 corresponds to the set of all spectral variants of a base matrix. The nearest 
points are found by gradually minimizing the KL divergence by repeating e and 
m projections. 


Our approximation problem is formulated as finding the nearest points in 
two manifolds: Find D £ V and M € A4 to minimize I\L{D,M). In geometric 
terms, this problem is to find the nearest points between e-flat and m-flat mani¬ 
folds. It is well known that such a problem is solved by an alternating procedure 
called the em algorithm (Amari, 1995). The em algorithm gradually minimizes 
the KL divergence by repeating e-step and m-step alternately (Fig. [j]). 

In the e-step, the following optimization problem is solved with fixing M: 
Find D G V that minimizes KL(D , M). This is rewritten as follows: Find D v h 
and Dhh that minimize 


L e =tr(DM x ) — logdet-D, (5) 

subject to the constraint that D >- 0. Notice that this constraint is not needed, 
because 

t 

log det D = y>g/,„ 

i=l 

where fii is the i -th eigenvalue of D. Here log det D is undefined when one of 
eigenvalues is negative, and log det D decreases to — oo as an eigenvalue get closer 
to 0. So, at the optimal solution, D is necessarily positive definite, because the 
KL divergence is infinite otherwise. As indicated by information geometry, this 
is a convex problem, which can readily be solved by any reasonable optimizer. 
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Moreover the solution is obtained in a closed form: Let us partition M 1 as 


The solution of 


— 1 ( S vv S v h \ 

"(si s„J- 

(6) 

is described as 


D vh = -KiS vh S^, 

(7) 

Dhh = S hh + S hh S v hKjS v hS hh . 

(8) 


The derivation of (]7j) and (|8|) will be described in Sec. |4.1 . 

In the m-step, the following optimization problem is solved with fixing D: 
Find M G M that minimizes KL(D, AI). This is rewritten as follows: Find 
b G Jr that minimizes 


e i 

L m = ^ bj tr (MjD) — logdet(^ bjMj) 
j =i 3 =i 


(9) 


subject to the constraint that i bjMj >- 0. Notice that this constraint can 
be ignored as well. When {Mj}^ =1 are defined as (||), the closed form solution 
of (||) is obtained as 


bi = 1/ tr(MjD), i = 


The derivation of (0) will be described in Sec. [O] 


( 10 ) 


4 Computing Projections 

This section presents the derivation of e and m-projections in detail. 

4.1 e-projection 

First we will show the derivation of e-projection (7) and (||). The log determi¬ 
nant of a partitioned matrix is rewritten as 

L e = tr(.DM _1 ) — logdet D 

= tr (DM -1 ) - logdet Kj - logdet (D hh - Dj h Kj 1 D vh ). 

When we partition M -1 as (j(|), it turns out that 

L e = tr(D vv S vv )+2tr(D vh S vh )+tr(D h hShh)-logdet K I -\ogdet(D hh -Dj h Ky 1 D vh ). 

( 11 ) 

The saddle point equation with respect to Dhh is obtained as 

dL 

vrpj = S hh — (Dhh — D vh Ki D v h) , (12) 
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because log det C = C 1 for any symmetric matrix C. Solving with 
respect to Dhh, we have 


D hh = S^ + Dj lh Kj 1 D vh . (13) 

Substituting (pjj) into (pi]), we have 

L e = tr(D vv S vv )+2ti(D vh S vh )+ti:(I+S h hDj h Kj 1 D vh )-\ogdetK I -\ogdetS h h. 

Now the saddle point equation with respect to D v h is obtained as 

77 tt — = 2 S v h + 2 Kj 1 D vh Shh = 0 . 
oD vh 

Solving this equation, we have the solution (Jq) for D v h. By substituting (0) into 
& we have the solution (|j) for Dhh- 


4.2 m-projection 

Next, we will show the derivation of m-projection (|loj). The in projection is 
obtained as the solution b to minimize 

t i 

L m = ^ bj tr (MjD) — logdet(0 bjMj). 

3 =1 3=1 

Since 9 log det Q~ 3 /dQ = Q, the saddle point equations are described as 

C 

tr(M,(y~] bjMj) -1 ) = tr(MjD), i = (14) 

3=1 

Remembering that Mj = VjvJ , we have 

J- 

3=1 3 = 1 3 

Since the left hand side of @) is 

, -r 1 T\ / 1 T\ 1 

tr (viVi 2^ y v 3 v j ) = trl-ViVi ) = -, 

3=1 1 1 1 

the solution of © is analytically obtained as 

bi = l/tr(M i D), i = 1, 

We have shown that the m-projection is obtained analytically when the model 
manifold corresponds to spectral variants of a matrix. However, it is not always 
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the case. For example, consider we have c base matrices N\,..., N c and the 
model manifold is constructed as harmonic mixture of them: 

C 

M = {M\M = C^2b j N j )~ 1 , ber, My 0}. (15) 

l=i 


This is an e-flat manifold so the optimization problem is convex, but the analyti- 
cal solvability depends on geometric properties of base matrices {-/Vj|)Li (Ohara, 


1999). We will briefly discuss this issue in the Appendix. 


5 Relation to the EM algorithm 


In statistical inference with missing data, the EM algorithm (Dempster et al. 


1977) is commonly used. By posing the matrix approximation problem as sta¬ 


tistical inference, the EM algorithm can be applied, and — as shown later 
it eventually leads to the same procedure. In a sense, it is misleading to relate 
matrix approximation to statistical concepts, such as random variables, obser¬ 
vations and so on. Nevertheless it would be meaningful to rewrite our method 
in terms of statistical concepts for establishing connections to other literature. 

Let v and h be the n and m dimensional visible and hidden variables. From 
observed dataQ, the covariance matrix of v is known as 


E 0 [vv ' ] = Ki, 


where E a denotes the expectation with respect to observed data. However, we 
do not know the covariances D v h = E 0 [vh T ) and Dhh = E 0 [hh r }. Our purpose 
is to obtain the maximum likelihood estimate of parameter b of the following 
Gaussian model: 


P{v,h\b) 


1 ( 1 

(2 7 r) d / 2 |M| 1 /2 exp ^ 2 


V 

h 


T 

M _1 


V 

h 


where M is described as ([|). In the course of maximum likelihood estimation, 
we have to estimate the observed covariances D v h and D in an appropriate 
way. The EM algorithm consists of the following two steps. 


• E-Step: Fix b and update D v h and D^h by conditional expectation. 

• M-Step: Fix D and update b by maximum likelihood estimation. 


It is shown that the likel ihood of observ e d dat a increases monotonically by 
repeating these two steps ( Dempster et al. , 1977 ). 

The M-step maximizes the likelihood, which is easily seen to be equivalent to 
minimizing the KL divergence (Arnari, 1995). So the M-step is equivalent to the 
m-step (§• However, the equivalence between E-step and e-step is not obvious, 

Hn fact, we do not have observed data in any sense. However, we assumed them as a 
matter of form for relating em and EM. 
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because the former is based on conditional expectation and the latter minimizes 
the KL divergence. In the E-step, the covariance matrices are computed from 
the conditional distribution described as 


p(%> b ) = (2 7 r )m/ 2 ] > g-i|i/2 exp (~^( h + S hh S vhv) T Shh{h + S h £sj h v)j , 

where S matrices are derived as (|]). Taking expectation with this distribution, 
we have 

E b [vh T | u] = -w T S vh S bb , 

E b [hh T \v} = S^ + S^Sj h w T S vh S^. 

Then the covariance matrices are estimated as 


D vh = E Q E b [vh T | v] = -KjSyhS^, 

Dhh = E 0 E b [hh | u] = S hh + S hh S vh KiS v hS hh . 


Since these solutions are equivalent to (0) and (|j), respecti vely, th e E-ste p is 
shown to be equivalent to the e-step in this case. Refer to Amari (1995) for 
general discussion of the equivalence between EM and em algorithms. 


6 Bacteria Classification Experiment 


In this section, we perform unsupervised classification experiments for bacteria 
based on two marker sequences: 16S and gyrB. Basically we would like to iden¬ 
tify the genus of a bacterium by means of extracted entities from the cell. It is 
known that several specific proteins and RNAs can be used for genus identifi¬ 
cation (Kasai et al., 1998[ ). Among them, we especially focus on 16S rRNA and 
gyrase subunit B (gyrB) protein. 16S rRNA is an essential constituent in all liv¬ 
ing organisms, and the existence of many conserved regions in the rRNA genes 
allows the alignment of their sequences derived from distantly related organ¬ 
isms, while their variable regions are useful for the distinction of closely related 
organisms. GyrB is a type II DNA topoisomerase which is an enzyme that 
controls and modifies the topological states of DNA supercoils. This protein is 
known to be well preserved over evolutional history among bacterial organisms 


sai et al. 


thus is supposed to be a better identifier than the traditional 16S rRNA (Ka- 
199S|). Notice that 16S is represented as a nucleotide sequence with 


4 symbols, and gyrB is an amino acid sequence with 20 symbols. Since gyrB 
has been found to be useful more recently than 16S (Yamamoto et al., 2000|), 


gyrB sequences are available only for a limited number of bacteria. Thus, it is 
considered that gyrB is more “expensive” than 16S. 

Our dataset has 52 bacteria of three genera ( Corynebacterium: 10, Mycobac¬ 
terium: 31, Rhodococcus: 11), each of which has both 16S and gyrB sequences. 
For simplicity, let us call these genera as class 1-3, respectively. For 16S and 
gyrB, we computed the second order count kernel, which is the dot product of 
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bimer counts (Lsuda et al., 2002). Each kernel matrix is normalized such that 
the norm of each sample in the feature space becomes one. The kernel matrices 
of gyrB and 16S can be seen in Fig. || (b) and (c), respectively. For reference, 
we show an ideal matrix as Fig. ^j(a), which indicates the true classes. In our 
senario, for a considerble number of bacteria, gyrB sequences are not available 
as in Fig. ||(d). We will complete the missing entries by the era algorithm with 
the spectral variants of 16S matrix. When the era algorithm converges, we end 
up with two matrices: the completed matrix on data manifold T> (Fig. |^(e)) and 
the estimated matrix on model manifold M (Fig. ^](f)). These two matrices are 
in general not the same, because the two manifolds may not have intersection. 

In order to evaluate the quality of completed and estimated matrices, K- 
rneans clustering is performed in the feature space of each kernel. In evaluating 
the partition, we use the Adjusted Rand Index (ARI) ( Hubert and Arabie , 1985| ; 
Yeung and Ruzzcj . |200l| ). Let Ui ,..., U c be the obtained clusters and T \,..., T s 
be the ground truth clusters. Let riij be the number of samples which belongs 
to both Ui and Tj. Also let ni, and n,j be the number of samples in Ui and Tj , 
respectively. ARI is defined as 


( n ij ^ 

\ 2 J 

Tn 

( n 

l 2 

) Ej (»j2) 

'Q 

1 

2 

Ei (" 2 ) + ft 

01 

- 

E, (ft) E 

ft')] 

'Q 


The attractive point of ARI is that it can measure the difference of two parti¬ 
tions even when the number of clusters is different. When the two partitions 
are exactly the same, ARI is 1, and the expected value of ARI over random 
partitions is 0 (see Hubert and Arabic (1985) for details). 

The clustering experiment is performed by randomly removing samples from 
gyrB data. The ratio of missing samples is changed from 0% to 90%. The 
ARIs of completed and estimated matrices averaged over 20 trials are shown 
in Fig. [l] and [|, respectively. Comparing the two matrices, the estimated ma¬ 
trix performed significantly worse than the complete matrix. It is because the 
completed matrix maintains existing entries unchanged, and so the class infor¬ 
mation in gyrB matrix is well preserved. We especially focus on the comparison 
between the completed matrix and 16S matrix, because there is no point in 
performing the era algorithm when 16S matrix works better than the completed 
matrix. According to the plot, the ARI of completed matrix was larger than 
16S matrix up to 50% missing ratio. It implies that the matrix completion is 
meaningful even in quite hard situations — 50% sample loss implies 75% loss 
in entries. This result encourages us (and hop efully readers) to apply th e era 
algorithm to other data such as gene networks (Vert and Kanehisa, |2002|) . 


7 Possible Extension 

As we related the era algorithm to maximum likelihood inference in Sec. |B|, it is 
straightforward to generalize it to the maximum a posteriori (MAP) inference 
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(a) Ideal 


(b) gyrB (complete) 







Figure 3: Clustering performance of the completed matrix. The solid curve 
shows the averaged ARI of the completed matrix, and the error bar describes 
the standard deviation. The upper and lower flat lines show the ARIs of the 
complete gyrB and 16S kernel matrices, respectively. 



Figure 4: Clustering performance of the estimated matrix. The solid curve 
shows the averaged ARI of the estimated matrix, and the error bar describes 
the standard deviation. 
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or more generally the Bayes inference (Robert, 1994). For example, we are going 
to modify the em algorithm to obtain the MAP estimate. The MAP estimation 
amounts to minimizing the KL divergence penalized by a prior, 

KL(D, M) - log?r(M), Del), M e M, 

where 7 r(M) is a prior distribution for M. Since the additional term — log7r(M) 
depends only on the model M, only the m-step is changed so as to minimize 
the above objective function with respect to M. 

Let us give a simple example of MAP estimation in the spectral variants 
case. In Bayesian inference, it is common to take a conjugate prior , so that 
the posterior distribution remains as a member of the exponential family. Since 
the model parameter b is related to a covariance matrix, we choose the Gamma 
distribution, which works as a conjugate prior for the variance of Gaussian 
distribution (Robert, 1994). The prior distribution is defined independently for 
each bj as 


■n{bj\ v,a) = 


1 


exp 


— — + {v - 1) log A 
a 


where v and a denote hyperparameters, by which the mean and the variance 
are specified by E(bj ) = cuv and V(bj) = a 2 v. The m step for MAP estimation 
is to minimize 


-MAP 


= L m - ^log7r(&j-; v,a), 
j=1 


which leads to the equation 


v — \ 

tr (Mi(^2 bjMj ) _1 ) + —— = tr(MiD) 
3 =1 ' 


1 

1 

a 




In the spectral variants case, the left hand side is reduced to p/6j, thus we obtain 
the MAP solution in a closed form as 


bi = 


tr (MiD) + 1 /a 1 


i = 


8 Conclusion 


In this paper, we introduced the information geometry in the space of kernel 
matrices, and applied the em algorithm in matrix approximation. The main dif¬ 
ference to other Euclidean methods is that we use the KL divergence. In general, 
we cannot determine which distance is better, because it is highly data depen¬ 
dent. However our method has a great utility, because it can be implemented 
only with algebraic computation and we do not need any specialized optimizer 
such as semidefinite programmming unlike (Graepej, 2002; Cristianini et al. 


2002; Lanckriet et al., 2002). 


One of our contribution is that we related matrix approximation to statistical 
inference in Sec. |^. Thus, in future works, it would be interesting to involve 
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advanced methods in statistical inference, such as generalized EM (Dempster 


et ah, 1977) and variational Bayes (Attias, 1999). Also we are looking forward 


to apply our method to diverse kinds of real data which are not limited to 
bioinformatics. 


A Analytical Solvability of the m-step 


In this appendix, we discuss the solvability of the m-step. The left hand side of 
(0 is the m-coordinate of the submanifold M ., while bj denote the e-coordinate 
of M. The e-coordinate and m-coo rdinate are connected by the Legendre trans¬ 
form ( Amari and Nagaoka , 2001 ). In the mother manifold V, the Legendre 
transform is easily obtained as the inverse of the matrix. In the submanifold Ai 
of P, however, it is difficult to obtain the Legendre transform in general. The 
difficulty is caused by the difference of geodesics defined in M and V. When 
the geodesic defined by a coordinate system of a submanifold S C V coincides 
the geodesic defined by the corresponding global coordinate system of V, the 
submanifold is called autoparallel. In our case, A4 is autoparallel for the e- 
coordinate, but it is not always autoparallel for the m-coordinate. When the 
submanifold is autoparallel for the both coordinate systems, the submanifold is 
called doubly autoparallel. 

Let us consider when a submanifold becomes doubly autoparallel. To begin 
with, let us define the product * between two d x d symmetric matrices X, Y G 
Sym(d), 

X*Y = -(XY + YX). (16) 


The algebra equipped with the usual matrix sum and the product (|16|) is called 
the Jordan algebra of the vector space of Sym(d). The following theorem pro¬ 
vides the necessary and sufficient condition for doubly autoparallel submanifold. 


Theorem 1 ( Ohara ( 1999j ), Theorem 4.6). Assume the identity matrix I 
is an element of the submanifold A4, Then A4 is doubly autoparallel if and only 
if the tangent space of M. is a Jordan subalgebra of Sym(d). 

When a submanifold A4 C V is determined as (0) , M is doubly autoparallel 
if the following holds for all i,j: 

Ni * Nj G span({V 1 ,..., N c }). 


Ohara (1999) has shown that, if and only if M is doubly autoparallel, the m- 
projection can be solved analytically, that is, the optimal solution is obtained 
by one Newton step. For example, in the spectral variants case, N t = VtvJ and 

Ni* Nj = 0 G span({7Vi,..., N c }). 

Thus the m-projection is obtained analytically in this case. 
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